About the project

This project is for the course Introduction to Open Data Science in University of Helsinki, fall 2018. The learning objectives of this course are to understand possibilities of open science, open data, and open research tools, visualize and analyze data sets with some fairly advanced statistical methods, master a few state-of-the-art, freely available, open software tools, apply the principle of reproducible research and see its practical advantages, and learn more of all this in the near and distant future (“life-long learning”). My github repository can be found at: https://github.com/ellitaimela/IODS-project


RStudio Exercise 2: Regression Analysis on Students’ Learning Data

The data analysis exercise for the second course week consisted of analysing students’ learning data from the course Introduction to Social Statistics, collected during December 2014 and January 2015. The exercise consisted of reading and exploring the structure of a dataset provided, showing a graphical overview and summaries of the data, and finally creating and analysing a regression model from the data.

  1. Reading and exploring the data

I retrieved the data provided for this exercise from AWS. The data, formatted in a .txt file, contained data of students’ learning outcomes in the course Introduction to Social Statistics held in fall 2014. The data included 166 observations of 7 variables. The variables represented the students’ gender, age, and attitude towards the course, and the students’ success in exam/exercise questions related to deep learning, strategic learning, and surface learning, and the granted for the students.

# Reading the data
lrn14 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", 
                    sep=",", header=TRUE)
# The data includes students' learning data from the course Introduction to Social Statistics, 
# collected during 3.12.2014 - 10.1.2015

# Exploring the structure and dimensions of the data; 166 observations, of 7 variables
str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(lrn14)
## [1] 166   7
# Summaries of the variables
summary(lrn14)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00
  1. Creating a graphical overview of the data

I created a graphical overview of the distribution of the variables and their relationships. First, I started with boxplotting the whole dataset to see and compare the distributions of the variables. Second, I created a relationship plot matrix with ggpairs to see the overall picture of the relationships between variables. After this, I examined the distributions of variables in more detail. I showed the distribution of students by gender with a barplot, and for the rest of the variables, I examined the distributions with histograms. I also analyzed the relationships of the variables with qplots and plots.

library(ggplot2)
library(GGally)

# Plotting the distributions of the variables
boxplot(lrn14)

# Creating a plot matrix with ggpairs
ggpairs(lrn14, lower = list(combo = wrap("facethist", bins = 20)))

# 110 females, 56 males
barplot(table(lrn14$gender))

# Age distribution from 17 to 55, median 22, mean 25.51, standard deviation 7.766078
hist(lrn14$age)

# Attitude from 1 to 5, median 3.2, mean 3.14, standard deviation 0.5327656
hist(lrn14$attitude)

# Points of deep learning questions from 1 to 5, median 3.667, mean 3.680, sd 0.5541369
hist(lrn14$deep)

# Points of strategic learning questions from 1 to 5, median 3.188, mean 3.121, sd 0.7718318
hist(lrn14$stra)

# Points of surface learning questions from 1 to 5, median 2.833, mean 2.787, sd 0.5288405
hist(lrn14$surf)

# Points granted altogether from 7 to 33, median 23.00, mean 22.72, sd 5.894884
hist(lrn14$points)

# A positive colleration between attitude and points can be found
qplot(attitude, points, data = lrn14) + geom_smooth(method = "lm")

qplot(age, points, data = lrn14) + geom_smooth(method = "lm")

qplot(surf, points, data = lrn14) + geom_smooth(method = "lm")

# There is no correlation between age and attitude
qplot(age, attitude, data = lrn14) + geom_smooth(method = "lm")

# The correlation between age to points achieved in strategic and deep questions looks minimal
qplot(age, stra, data = lrn14) + geom_smooth(method = "lm")

qplot(age, deep, data = lrn14) + geom_smooth(method = "lm")

# There can be seen some negative correlation between age and points achieved in surface questions - statistical significance not analyzed
qplot(age, surf, data = lrn14) + geom_smooth(method = "lm")

# On average, men achieved marginally higher points compared to women - statistical significance not analyzed
plot(lrn14$points ~ lrn14$gender)

# On average, women achieved higher points in strategic and surface learning questions - statistical significance not analyzed
plot(lrn14$stra ~ lrn14$gender)

plot(lrn14$deep ~ lrn14$gender)

plot(lrn14$surf ~ lrn14$gender)

  1. Creating a regression model

I started building the regression model by choosing three variables - attitude, stra, and surf - as explanatory variables and fitting a regression model where exam points was the dependent variable. I chose the variables because they had the hichest correlation with points, as could be seen in the ggpairs matrix.

# Fitting a linear model
model1 <- lm(points ~ attitude + stra + surf, data = lrn14)
summary(model1)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Of the explanatory variables, only attitude turned out statistically significant, its p-value being 1.93e-08. The p-values of stra and surf were not statistically significant, so I excluded them from the further analysis. I eventually tested combinations of all variables in the dataset, but I found no other statistically significant explanatory variable than attitude. Therefore, I created a new regression model with attitude as the only explanatory variable. Statistically significant (p<0.0001) estimates for both the intercept and attitude were found.

# Fitting a new model with attitude as the only explanatory variable
model2 <- lm(points ~ attitude, data = lrn14)
summary(model2)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
  1. Relationship between variables and goodness of fit

As can be seen in the summary of the regression model above (model2), exam points are positively correlated with the one’s attitude. This is intuitional, because a high (positive) attitude towards a specific topic typically equals a higher interest and motivation to learn, which typically leads one to investing more time and effort to the topic. OF the summary, we can see that one increase in attitude (scale from 1 to 5) increases points by 3.5255 on average.

The residual sum of squares (multiple R-squred) represents how close the data is to the fitted regression line of the model. R-squared achieves values from 0 to 100 % (or 1), which describes the percentage of the target variable variation that can be explained by the model. Multiple R-squared of model2 is 0.1906, which means that a minor part of the data lies close to the regression line.

  1. Validity of assumptions of the model

One natural assumption of the linear regression model created above is linearity. Another assumption related to linear regression models is that errors are normally distributed, they are not correlated, and have a constant variance. To analyze whether these assumptions are actually valid, we can take a closer look at the residuals of the model.

The normality of the errors can be analyzed with the Q-Q plot. As can be seen, a clear majority of the residuals follow the line. Only the residuals at very low and very high values deviate from the line. Therefore we can assume the error terms mostly to be normally distributed.

By plotting the residuals towards the fitted values of the model, we can see if there is any pattern that tells how the residuals change and deduce whether the variance of the residuals is constant. As we look at the Residuals vs. Fitted plot below, there does not seem to be a clear pattern. Therefore we can consider the assumption of a constant variance somewhat valid.

By examining the Residuals vs. Leverage plot below, we can also analyze the impact of single observations on the model. As we can see, no single observation stands out clearly, and the value of the leverages of all observations are low. Therefore we can reason that the impact of single observations on the model is not too high.

par(mfrow = c(2,2))

# Visualizing Residuals vs. Fitted values
plot(model2, which = 1)

# Visualizing a normal QQ-plot
plot(model2, which = 2)

# Visualizing Residuals vs. Leverage
plot(model2, which = 5)


RStudio Exercise 3: Logistic Regression

The data analysis exercise for the third course week consisted of analysing the relationship between students’ alcohol consumption and learning outcomes.

  1. Creating the markdown file

A new markdown file ‘chapter3.Rmd’ was created and included as a child file in the ‘index.Rmd’ file.

  1. Reading and exploring the data

I read the data formulated in the Data Wrangling part of this week’s exercise, and checked the structure, dimensions and summary of the dataset. The data describes student achievement in secondary education at two Portuguese schools. The data attributes included student grades, demographic, social and school related features, and it was collected by using school reports and questionnaires. Two datasets were provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
library(tidyr)

# Setting the working directory
setwd("/Users/ellitaimela/Documents/Github/IODS-project/Data")
# Reading the data
alc <- read.csv("alc.csv", sep=",")
# This data approach student achievement in secondary education of two Portuguese schools. 
# The data attributes include student grades, demographic, social and school related features) 
# and it was collected by using school reports and questionnaires. Two datasets are provided 
# regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).

# Checking the data, and the structure, dimensions and summary of the dataset
# alc
str(alc)
## 'data.frame':    382 obs. of  36 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
dim(alc)
## [1] 382  36
summary(alc)
##        X          school   sex          age        address famsize  
##  Min.   :  1.00   GP:342   F:198   Min.   :15.00   R: 81   GT3:278  
##  1st Qu.: 96.25   MS: 40   M:184   1st Qu.:16.00   U:301   LE3:104  
##  Median :191.50                    Median :17.00                    
##  Mean   :191.50                    Mean   :16.59                    
##  3rd Qu.:286.75                    3rd Qu.:17.00                    
##  Max.   :382.00                    Max.   :22.00                    
##  Pstatus      Medu            Fedu             Mjob           Fjob    
##  A: 38   Min.   :0.000   Min.   :0.000   at_home : 53   at_home : 16  
##  T:344   1st Qu.:2.000   1st Qu.:2.000   health  : 33   health  : 17  
##          Median :3.000   Median :3.000   other   :138   other   :211  
##          Mean   :2.806   Mean   :2.565   services: 96   services:107  
##          3rd Qu.:4.000   3rd Qu.:4.000   teacher : 62   teacher : 31  
##          Max.   :4.000   Max.   :4.000                                
##         reason    nursery   internet    guardian     traveltime   
##  course    :140   no : 72   no : 58   father: 91   Min.   :1.000  
##  home      :110   yes:310   yes:324   mother:275   1st Qu.:1.000  
##  other     : 34                       other : 16   Median :1.000  
##  reputation: 98                                    Mean   :1.448  
##                                                    3rd Qu.:2.000  
##                                                    Max.   :4.000  
##    studytime        failures      schoolsup famsup     paid     activities
##  Min.   :1.000   Min.   :0.0000   no :331   no :144   no :205   no :181   
##  1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:238   yes:177   yes:201   
##  Median :2.000   Median :0.0000                                           
##  Mean   :2.037   Mean   :0.2016                                           
##  3rd Qu.:2.000   3rd Qu.:0.0000                                           
##  Max.   :4.000   Max.   :3.0000                                           
##  higher    romantic      famrel         freetime        goout      
##  no : 18   no :261   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  yes:364   yes:121   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:2.000  
##                      Median :4.000   Median :3.00   Median :3.000  
##                      Mean   :3.937   Mean   :3.22   Mean   :3.113  
##                      3rd Qu.:5.000   3rd Qu.:4.00   3rd Qu.:4.000  
##                      Max.   :5.000   Max.   :5.00   Max.   :5.000  
##       Dalc            Walc           health         absences   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   : 0.0  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 1.0  
##  Median :1.000   Median :2.000   Median :4.000   Median : 3.0  
##  Mean   :1.482   Mean   :2.296   Mean   :3.573   Mean   : 4.5  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 6.0  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :45.0  
##        G1              G2              G3           alc_use     
##  Min.   : 2.00   Min.   : 4.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.:10.00   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:1.000  
##  Median :12.00   Median :12.00   Median :12.00   Median :1.500  
##  Mean   :11.49   Mean   :11.47   Mean   :11.46   Mean   :1.889  
##  3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:2.500  
##  Max.   :18.00   Max.   :18.00   Max.   :18.00   Max.   :5.000  
##   high_use      
##  Mode :logical  
##  FALSE:268      
##  TRUE :114      
##                 
##                 
## 
  1. Hypothesis of the relationships between high/low alcohol consumption and other variables

I chose 4 variables to study their relationships with the variables and high/low alcohol consumption. The variables I chose were famrel (quality of family relationships), goout (going out with friends), absences (number of school absences), and G3 (final grade). According to my initial hypothesis, students who have worse relationships with their family tend to feel worse and consume more alcohol. I also hypothesized that going out with friends and the number of school absences positively correlate with alcohol consumption, and the final grade negatively correlates with alcohol consumption.

  1. Variables’ relationships with alcohol consumption

I draw boxplots of the chosen variables distributions with division to high and low alcohol consumption. As can be seen, the results are in line with my initial hypotheses. The average grades are higher with students that consume less alcohol, and students who have better relationships with their family members consume less alcohol. Students who go out with friends more or have more absences also consume more alcohol. The numerical differences of the means of the variables can also be seen below. However, the statistical signifigance cannot be deduced from these results.

# initialize a plot of high_use and family relationships
g1 <- ggplot(alc, aes(x = high_use, y = famrel))

# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("sex")

# initialize a plot of high_use and going out
g2 <- ggplot(alc, aes(x = high_use, y = goout))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("going out")

# initialize a plot of high_use and absences
g3 <- ggplot(alc, aes(x = high_use, y = absences))

# define the plot as a boxplot and draw it
g3 + geom_boxplot() + ylab("absences")

# initialize a plot of high_use and G3
g4 <- ggplot(alc, aes(x = high_use, y = G3))

# define the plot as a boxplot and draw it
g4 + geom_boxplot() + ylab("grade")

# produce summary statistics by group
alc %>% group_by(high_use) %>% summarise(count = n(), famrel = mean(famrel), goout = mean(goout), absences = mean(absences), mean_grade = mean(G3))
## # A tibble: 2 x 6
##   high_use count famrel goout absences mean_grade
##   <lgl>    <int>  <dbl> <dbl>    <dbl>      <dbl>
## 1 FALSE      268   4.00  2.85     3.71       11.7
## 2 TRUE       114   3.78  3.72     6.37       10.8
  1. Using logistic regression

I created a logistic regression model m with the boolean variable high_use as the target variable. High_use returns true if a student consumes alcohol highly. The explaining variables are the previously chosen variables: famrel, goout, absences, and G3. According to the results (see the output of summary(m)), going out and the number of absences positively correlate with high alcohol consumption very significantly (p<0.001). Family relationships (1 = bad, 5 = good) are negatively correlated with alcohol consumption (p<0.05). The relationship between alcohol consumption and the final grade G3 was not statistically significant (p = 0.24).

I combined and printed out the odds ratios and their confidence intervals. As can be seen from the print below, the predictor goout has the widest confidence interval. The interval of G3 is contains the value 1, with an odd ratio of 0.95 - therefore it is hard to determine if G3 is positively or negatively associated with high_use. The odd ratios and confidence intervals tell that goout is (positively) associated with high_use the most. Famrel is negatively associated with high_use (the whole confidence interval <1). Absences are positively associated with high_use, but only a littse, since the confidence interval stands between 1.032 and 1.125.

# find the model with glm()
m <- glm(high_use ~ famrel + goout + absences + G3, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ famrel + goout + absences + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9389  -0.7743  -0.5392   0.9022   2.3966  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.83044    0.78397  -2.335 0.019552 *  
## famrel      -0.34483    0.13563  -2.542 0.011010 *  
## goout        0.75149    0.12033   6.245 4.23e-10 ***
## absences     0.07261    0.02174   3.340 0.000839 ***
## G3          -0.04482    0.03843  -1.166 0.243529    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 394.43  on 377  degrees of freedom
## AIC: 404.43
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept)      famrel       goout    absences          G3 
## -1.83043900 -0.34482682  0.75148543  0.07261377 -0.04481705
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- exp(confint(m))
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR     2.5 %   97.5 %
## (Intercept) 0.1603432 0.0331851 0.724006
## famrel      0.7083430 0.5415610 0.923423
## goout       2.1201470 1.6848948 2.703401
## absences    1.0753151 1.0316028 1.124827
## G3          0.9561724 0.8866759 1.031252
  1. Predictive power of the model

I refined my model and excluded the final grade G3 from it because the relationship with alcohol consumption was not statistically significant.

As can be see below from the 2x2 cross tabulation, 69 of (244 + 69 =) 313 students who were predicted as not low users were actually high users of alcohol. 24 of (24 + 45 =) 69 students who were predicted as high users were actually low users. Therefore (69+24)/(313+69) = 0.243455 = 24.3 percent of the individuals were classified inaccurately. The same results can be achieved by building a similar loss function as in the DataCamp exercise. As can be seen, the prediction power is not perfect but it is higher compared to a simple, totally random guessing strategy where the probability of guessing righ from two options would be only 50 percent.

# find the model with glm()
m2 <- glm(high_use ~ famrel + goout + absences, data = alc, family = "binomial")

# print out a summary of the model
summary(m2)
## 
## Call:
## glm(formula = high_use ~ famrel + goout + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9018  -0.7731  -0.5466   0.9002   2.4180  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.38825    0.63161  -3.781 0.000156 ***
## famrel      -0.34986    0.13534  -2.585 0.009735 ** 
## goout        0.77071    0.11981   6.433 1.25e-10 ***
## absences     0.07446    0.02174   3.425 0.000615 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 395.79  on 378  degrees of freedom
## AIC: 403.79
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m2)
## (Intercept)      famrel       goout    absences 
## -2.38824753 -0.34985642  0.77071403  0.07445633
# predict() the probability of high_use
probabilities <- predict(m2, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   244   24
##    TRUE     69   45
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.63874346 0.06282723 0.70157068
##    TRUE  0.18062827 0.11780105 0.29842932
##    Sum   0.81937173 0.18062827 1.00000000
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, alc$probability)
## [1] 0.2434555
  1. Cross-validation

According to a 10-fold cross-validation of the model m2, the prediction error with the test set gets values of around 0.23 to 0.25, which are smaller than the error of the model introduced in DataCamp (circa 0.26 error). The code and results can be seen below.

library(boot)

# K-fold cross-validation
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2486911

RStudio Exercise 4: Clustering and classification

The data analysis exercise for the third course week consisted of analysing crime rates in the suburbs of Boston.

  1. Creating the markdown file

A new markdown file ‘chapter4.Rmd’ was created and included as a child file in the ‘index.Rmd’ file.

  1. Loading and exploring the data

I read the Boston data that includes housing values in the suburbs of Boston. Columns include data of for example per capita crime rate, proportion of non-retail business acres, and weighted mean of distances to five Boston employment centres by town. The dimensions of the data are 506 x 14.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(GGally)
library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble  1.4.2     v purrr   0.2.5
## v readr   1.1.1     v stringr 1.3.1
## v tibble  1.4.2     v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x MASS::select()  masks dplyr::select()
library(corrplot)
## corrplot 0.84 loaded
library(dplyr)

# Loading the data
data(Boston)

# Checking the data, and the structure, dimensions and summary of the dataset
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
  1. Graphical overview of the data

I the graphical overview of the data. As can be seen from the ggpairs matrix and the summary statistics, the distributions vary between the variables. Crim, for example, can be considered a Poisson distribution, and rm a normal distribution. Many of the distributions are left- or right-taled.

The correlations between variables seem logical. For example crime rates, industrial areas closeby, nitrogen oxides concentration, the high age of the buildings, the accessibility to radial highways, the property tax rate, the pupil-teacher ratio, and the lower status of the population negatively correlate with the median value of occupied homes within a town. The Charles River dummy variable has the smallest absolute correlation among other variables.

# Graphical overview of the data
pairs(Boston)

ggpairs(Boston, lower = list(combo = wrap("facethist", bins = 20)))

# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(2)

# print the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

# Summaries of the variables
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
  1. Standardization of the dataset

I standardized the dataset with the scale-function. The data changed such that the median value of each variable returns 0, and the scaled values have the same variance and form of distribution than the original dataset. I also created a categorical variable of the crime rate in the Boston dataset from the scaled crime rate. I used the quantiles as the break points in the catgorical variable. The old crime rate variable was then dropped from the dataset. Then I divided the dataset to train and test sets so that 80 percent of the data belonged to the train set and 20 percent to the test set.

boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# summary of the scaled dataset
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
label_vector <- c("low", "med_low", "med_high", "high")

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = label_vector) 

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
#remove the origical crime rate crim and add crime
boston_scaled <- subset(boston_scaled, select = -c(crim))
boston_scaled$crime <- crime
str(boston_scaled)
## 'data.frame':    506 obs. of  14 variables:
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]
  1. Linear discriminant analysis

I fit the Linear Discriminant Analysis on the train set. I used the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. After this, I drew the LDA plot.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2500000 0.2524752 0.2450495 
## 
## Group means:
##                   zn      indus        chas        nox         rm
## low       1.04057307 -0.9158927 -0.07933396 -0.8849085  0.5228532
## med_low  -0.07969513 -0.2767872 -0.03844192 -0.5712067 -0.1312078
## med_high -0.38551208  0.1370886  0.26805724  0.3356399  0.1445756
## high     -0.48724019  1.0171737 -0.03371693  1.0386422 -0.3698266
##                 age        dis        rad        tax      ptratio
## low      -0.8949757  0.8783621 -0.7082658 -0.7567536 -0.523784528
## med_low  -0.3107823  0.3748689 -0.5440892 -0.4646832 -0.006443321
## med_high  0.4252114 -0.3758448 -0.4132674 -0.3270467 -0.226263117
## high      0.7989934 -0.8340679  1.6375616  1.5136504  0.780117019
##               black       lstat        medv
## low       0.3737511 -0.78426915  0.61981760
## med_low   0.3091010 -0.14065827 -0.02466708
## med_high  0.1065346  0.01940936  0.18958840
## high     -0.7808641  0.86941597 -0.64595193
## 
## Coefficients of linear discriminants:
##                   LD1          LD2         LD3
## zn       0.0806145506  0.785768591 -0.90194935
## indus    0.0003166139 -0.129250049  0.31497574
## chas    -0.0838745716 -0.081220574  0.04910057
## nox      0.4011851216 -0.716609258 -1.40016652
## rm      -0.0946999313 -0.062982691 -0.22278107
## age      0.3010893364 -0.355532119 -0.04893558
## dis     -0.0194243246 -0.363184403  0.25444871
## rad      3.1415001222  1.013436267 -0.08687102
## tax      0.0545009694 -0.144443911  0.65013377
## ptratio  0.0855368176  0.024988320 -0.25696948
## black   -0.1169816299 -0.006244425  0.10238013
## lstat    0.2863303041 -0.246798853  0.11824671
## medv     0.2274576490 -0.321434628 -0.29175433
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9495 0.0367 0.0138
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen=2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 2)

  1. Linear discriminant analysis

I saved the crime categories from the test set and removed the categorical crime variable from the test dataset. After this, I predicted the classes with the LDA model on the test data. When cross-tabulating the results with the crime categories from the test set, you can see that the model is more capable in predicting high crime rates compared to low crime rates.

# save the correct classes from test data
correct_classes <- test$crime

test <- subset(test, select = -c(crime))

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
str(lda.pred)
## List of 3
##  $ class    : Factor w/ 4 levels "low","med_low",..: 2 2 3 3 2 2 2 2 1 1 ...
##  $ posterior: num [1:102, 1:4] 0.20532 0.10012 0.01338 0.00623 0.09736 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:102] "2" "7" "8" "9" ...
##   .. ..$ : chr [1:4] "low" "med_low" "med_high" "high"
##  $ x        : num [1:102, 1:3] -3.182 -1.862 -1.2 -0.903 -1.983 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:102] "2" "7" "8" "9" ...
##   .. ..$ : chr [1:3] "LD1" "LD2" "LD3"
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       13      12        0    0
##   med_low    5      15        5    0
##   med_high   0       6       17    1
##   high       0       0        0   28
  1. K-means algorithm

I reloaded the Boston dataset, standardized the dataset, calculated the distances between the observations, and ran the k-means algorithm on the dataset. When visualizing the clusters with pairs, you can see that the optimal number of clusters turns out to be 3.

# load MASS and Boston
library(MASS)
data('Boston')

boston_scaled.new <- scale(Boston)

# euclidean distance matrix
dist_eu <- dist(boston_scaled.new)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled.new, method="manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
# k-means clustering
km <-kmeans(Boston, centers = 3)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

pairs(Boston[6:10], col = km$cluster)


RStudio Exercise 6: Analysis of longitudinal data

At the final week we analysed longitudinal data. We started with graphical displays and the summary method approach, and continued with linear mixed effects models for normal response variables.

Data Wrangling

The code for data wrangling can be found at my Github repository. The code without prints or graphs is implemented below to enable further analysis.

library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally)
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", 
                 sep=" ", header=TRUE)

rats <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", 
                 sep="\t", header=TRUE)

# Factor treatment & subject
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)

rats$ID <- factor(rats$ID)
rats$Group <- factor(rats$Group)

# Convert to long form
BPRSL <-  BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
# Extract the week number
BPRSL <-  BPRSL %>% mutate(week = as.integer(substr(weeks, 5, 5)))


# Convert to long form
ratsl <-  rats %>% gather(key = WDs, value = weight, -ID, -Group)
# Extract the wd
ratsl <-  ratsl %>% mutate(WD = as.integer(substr(WDs, 3, 4)))

Part I: Graphical Displays and Summary Method Approach

The data included observations of 16 subjects (rats) that were divided into 3 groups. When analysing the data in the long form, we can analyse the individual response profiles by treatment group by building a ggplot that visualizes the responses of the individual subjects accross time. We can see that the rats in the Group 1 (n=8) have lower weights compared to Groups 2 (n=4) and 3 (n=4). The weights of the rats at Group 2 have the highest variance: one rat has significantly higher weight than the others.

ratsl
##     ID Group  WDs weight WD
## 1    1     1  WD1    240  1
## 2    2     1  WD1    225  1
## 3    3     1  WD1    245  1
## 4    4     1  WD1    260  1
## 5    5     1  WD1    255  1
## 6    6     1  WD1    260  1
## 7    7     1  WD1    275  1
## 8    8     1  WD1    245  1
## 9    9     2  WD1    410  1
## 10  10     2  WD1    405  1
## 11  11     2  WD1    445  1
## 12  12     2  WD1    555  1
## 13  13     3  WD1    470  1
## 14  14     3  WD1    535  1
## 15  15     3  WD1    520  1
## 16  16     3  WD1    510  1
## 17   1     1  WD8    250  8
## 18   2     1  WD8    230  8
## 19   3     1  WD8    250  8
## 20   4     1  WD8    255  8
## 21   5     1  WD8    260  8
## 22   6     1  WD8    265  8
## 23   7     1  WD8    275  8
## 24   8     1  WD8    255  8
## 25   9     2  WD8    415  8
## 26  10     2  WD8    420  8
## 27  11     2  WD8    445  8
## 28  12     2  WD8    560  8
## 29  13     3  WD8    465  8
## 30  14     3  WD8    525  8
## 31  15     3  WD8    525  8
## 32  16     3  WD8    510  8
## 33   1     1 WD15    255 15
## 34   2     1 WD15    230 15
## 35   3     1 WD15    250 15
## 36   4     1 WD15    255 15
## 37   5     1 WD15    255 15
## 38   6     1 WD15    270 15
## 39   7     1 WD15    260 15
## 40   8     1 WD15    260 15
## 41   9     2 WD15    425 15
## 42  10     2 WD15    430 15
## 43  11     2 WD15    450 15
## 44  12     2 WD15    565 15
## 45  13     3 WD15    475 15
## 46  14     3 WD15    530 15
## 47  15     3 WD15    530 15
## 48  16     3 WD15    520 15
## 49   1     1 WD22    260 22
## 50   2     1 WD22    232 22
## 51   3     1 WD22    255 22
## 52   4     1 WD22    265 22
## 53   5     1 WD22    270 22
## 54   6     1 WD22    275 22
## 55   7     1 WD22    270 22
## 56   8     1 WD22    268 22
## 57   9     2 WD22    428 22
## 58  10     2 WD22    440 22
## 59  11     2 WD22    452 22
## 60  12     2 WD22    580 22
## 61  13     3 WD22    485 22
## 62  14     3 WD22    533 22
## 63  15     3 WD22    540 22
## 64  16     3 WD22    515 22
## 65   1     1 WD29    262 29
## 66   2     1 WD29    240 29
## 67   3     1 WD29    262 29
## 68   4     1 WD29    265 29
## 69   5     1 WD29    270 29
## 70   6     1 WD29    275 29
## 71   7     1 WD29    273 29
## 72   8     1 WD29    270 29
## 73   9     2 WD29    438 29
## 74  10     2 WD29    448 29
## 75  11     2 WD29    455 29
## 76  12     2 WD29    590 29
## 77  13     3 WD29    487 29
## 78  14     3 WD29    535 29
## 79  15     3 WD29    543 29
## 80  16     3 WD29    530 29
## 81   1     1 WD36    258 36
## 82   2     1 WD36    240 36
## 83   3     1 WD36    265 36
## 84   4     1 WD36    268 36
## 85   5     1 WD36    273 36
## 86   6     1 WD36    277 36
## 87   7     1 WD36    274 36
## 88   8     1 WD36    265 36
## 89   9     2 WD36    443 36
## 90  10     2 WD36    460 36
## 91  11     2 WD36    455 36
## 92  12     2 WD36    597 36
## 93  13     3 WD36    493 36
## 94  14     3 WD36    540 36
## 95  15     3 WD36    546 36
## 96  16     3 WD36    538 36
## 97   1     1 WD43    266 43
## 98   2     1 WD43    243 43
## 99   3     1 WD43    267 43
## 100  4     1 WD43    270 43
## 101  5     1 WD43    274 43
## 102  6     1 WD43    278 43
## 103  7     1 WD43    276 43
## 104  8     1 WD43    265 43
## 105  9     2 WD43    442 43
## 106 10     2 WD43    458 43
## 107 11     2 WD43    451 43
## 108 12     2 WD43    595 43
## 109 13     3 WD43    493 43
## 110 14     3 WD43    525 43
## 111 15     3 WD43    538 43
## 112 16     3 WD43    535 43
## 113  1     1 WD44    266 44
## 114  2     1 WD44    244 44
## 115  3     1 WD44    267 44
## 116  4     1 WD44    272 44
## 117  5     1 WD44    273 44
## 118  6     1 WD44    278 44
## 119  7     1 WD44    271 44
## 120  8     1 WD44    267 44
## 121  9     2 WD44    446 44
## 122 10     2 WD44    464 44
## 123 11     2 WD44    450 44
## 124 12     2 WD44    595 44
## 125 13     3 WD44    504 44
## 126 14     3 WD44    530 44
## 127 15     3 WD44    544 44
## 128 16     3 WD44    542 44
## 129  1     1 WD50    265 50
## 130  2     1 WD50    238 50
## 131  3     1 WD50    264 50
## 132  4     1 WD50    274 50
## 133  5     1 WD50    276 50
## 134  6     1 WD50    284 50
## 135  7     1 WD50    282 50
## 136  8     1 WD50    273 50
## 137  9     2 WD50    456 50
## 138 10     2 WD50    475 50
## 139 11     2 WD50    462 50
## 140 12     2 WD50    612 50
## 141 13     3 WD50    507 50
## 142 14     3 WD50    543 50
## 143 15     3 WD50    553 50
## 144 16     3 WD50    550 50
## 145  1     1 WD57    272 57
## 146  2     1 WD57    247 57
## 147  3     1 WD57    268 57
## 148  4     1 WD57    273 57
## 149  5     1 WD57    278 57
## 150  6     1 WD57    279 57
## 151  7     1 WD57    281 57
## 152  8     1 WD57    274 57
## 153  9     2 WD57    468 57
## 154 10     2 WD57    484 57
## 155 11     2 WD57    466 57
## 156 12     2 WD57    618 57
## 157 13     3 WD57    518 57
## 158 14     3 WD57    544 57
## 159 15     3 WD57    555 57
## 160 16     3 WD57    553 57
## 161  1     1 WD64    278 64
## 162  2     1 WD64    245 64
## 163  3     1 WD64    269 64
## 164  4     1 WD64    275 64
## 165  5     1 WD64    280 64
## 166  6     1 WD64    281 64
## 167  7     1 WD64    284 64
## 168  8     1 WD64    278 64
## 169  9     2 WD64    478 64
## 170 10     2 WD64    496 64
## 171 11     2 WD64    472 64
## 172 12     2 WD64    628 64
## 173 13     3 WD64    525 64
## 174 14     3 WD64    559 64
## 175 15     3 WD64    548 64
## 176 16     3 WD64    569 64
dim(rats)
## [1] 16 13
summary(rats)
##        ID     Group      WD1             WD8             WD15      
##  1      : 1   1:8   Min.   :225.0   Min.   :230.0   Min.   :230.0  
##  2      : 1   2:4   1st Qu.:252.5   1st Qu.:255.0   1st Qu.:255.0  
##  3      : 1   3:4   Median :340.0   Median :345.0   Median :347.5  
##  4      : 1         Mean   :365.9   Mean   :369.1   Mean   :372.5  
##  5      : 1         3rd Qu.:480.0   3rd Qu.:476.2   3rd Qu.:486.2  
##  6      : 1         Max.   :555.0   Max.   :560.0   Max.   :565.0  
##  (Other):10                                                        
##       WD22            WD29            WD36            WD43      
##  Min.   :232.0   Min.   :240.0   Min.   :240.0   Min.   :243.0  
##  1st Qu.:267.2   1st Qu.:268.8   1st Qu.:267.2   1st Qu.:269.2  
##  Median :351.5   Median :356.5   Median :360.0   Median :360.0  
##  Mean   :379.2   Mean   :383.9   Mean   :387.0   Mean   :386.0  
##  3rd Qu.:492.5   3rd Qu.:497.8   3rd Qu.:504.2   3rd Qu.:501.0  
##  Max.   :580.0   Max.   :590.0   Max.   :597.0   Max.   :595.0  
##                                                                 
##       WD44            WD50            WD57            WD64      
##  Min.   :244.0   Min.   :238.0   Min.   :247.0   Min.   :245.0  
##  1st Qu.:270.0   1st Qu.:273.8   1st Qu.:273.8   1st Qu.:278.0  
##  Median :362.0   Median :370.0   Median :373.5   Median :378.0  
##  Mean   :388.3   Mean   :394.6   Mean   :398.6   Mean   :404.1  
##  3rd Qu.:510.5   3rd Qu.:516.0   3rd Qu.:524.5   3rd Qu.:530.8  
##  Max.   :595.0   Max.   :612.0   Max.   :618.0   Max.   :628.0  
## 
dim(ratsl)
## [1] 176   5
names(ratsl)
## [1] "ID"     "Group"  "WDs"    "weight" "WD"
str(ratsl)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WDs   : chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ WD    : int  1 1 1 1 1 1 1 1 1 1 ...
summary(ratsl)
##        ID      Group      WDs                weight            WD       
##  1      : 11   1:88   Length:176         Min.   :225.0   Min.   : 1.00  
##  2      : 11   2:44   Class :character   1st Qu.:267.0   1st Qu.:15.00  
##  3      : 11   3:44   Mode  :character   Median :344.5   Median :36.00  
##  4      : 11                             Mean   :384.5   Mean   :33.55  
##  5      : 11                             3rd Qu.:511.2   3rd Qu.:50.00  
##  6      : 11                             Max.   :628.0   Max.   :64.00  
##  (Other):110
ggplot(ratsl, aes(x = WD, y = weight, linetype = ID)) + geom_line() +
  scale_linetype_manual(values = rep(1:10, times = 4)) + facet_grid(.~Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(ratsl$weight), max(ratsl$weight)))

We can take tracking into account when we standardize the data and plot the standardized responses. Standardization makes comparind individuals in the same groups and between groups more applicable. We can see from the plot below that the same individuals stand out as previously.

The individuals in Group 1 seem to gain weight quite steeply compared to at least Group 2. One individual whose weight dramatically drops in the beginning can be pointed out.

The individuals in Group 2 are not gaining weight as much compared to other groups. However, the growth is steadier.

The individuals in Group 3 are gaining weight more than Group 2 and more steadily than individuals in Group 1. All individuals in this group gain weight.

# Standardized BPRS plots by groups
ratsl <- ratsl %>% group_by(Group) %>% mutate(weights_standardized = (weight - mean(weight))/sd(weight)) %>% ungroup()

ggplot(ratsl, aes(x = WD, y = weights_standardized, linetype = ID)) + geom_line() +
  scale_linetype_manual(values = rep(1:10, times = 4)) + facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized bprs")

Next we will analyse the average response profiles of each group of individuals. In the plot below, the standard error mean acts as an indicator of variance. We can see that the curves or their variances do not overlap. The mean starting weights of the groups were different, Groups 2 and 3 more closer to each other as mentioned before. As said before, Group 2 has the highest variance. Group 1 seems to have the smallest variance. The average weight of the individuals in a group grew the most in Group 2.

# Summary measure analysis of longitudinal data
n <- ratsl$WD %>% unique() %>% length()
ratss <- ratsl %>% group_by(Group, WD) %>% summarise( mean = mean(weight), se = sd(weight)/sqrt(n)) %>% ungroup()

# glimpse(ratss)

# Plot the mean profiles
ggplot(ratss, aes(x = WD, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.8)) +
  scale_y_continuous(name = "mean(weight) +/- se(weight)")

We can also compare the avarage weights of the groups by boxplotting the data as below. The first-day respose/observation was removed from the data. We can see similar results from the boxplots as discussed above. However, now we can see more clearly that some outliers exist in each group.

# Create a summary data by treatment and subject with mean as the summary variable.
ratsl8s <- ratsl %>%
  filter(WD > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(weight)) %>%
  ungroup()

# Glimpse the data
# glimpse(ratsl8s)

# Draw a boxplot of the mean versus group
ggplot(ratsl8s, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(weight), time 0-60")

When we filter the outliers, we can see that the variance decreass in each group. For Group 1, the mean grows a bit (from around 265 to 275). For Group 2, the mean decreases from around 490 to 450. For Group 3, the mean increases from around 530 to 540. Now we can see more clearly that the weights of the individuals in Groups 2 and 3 are on different levels.

# Filtering the outliers
ratsl8s_1 <- ratsl8s %>% filter(Group == 3)
ratsl8s_1 <- ratsl8s_1 %>% filter(mean > 500)
ratsl8s_2 <- ratsl8s %>% filter(Group !=3)
ratsl8s_2 <- ratsl8s_2 %>% filter(mean>250) %>% filter(mean < 500)
ratsl8s_filtered <- rbind(ratsl8s_2, ratsl8s_1)

ggplot(ratsl8s_filtered, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(weight), whole study period")

To analyse differences between the Groups in more detail, we perform a simple 2-sided t-test. We will perform the t-test with 3 different pairs of Groups because the test can handle only 2 groups at a time.

We can see that the Group 1 and Group2, as well as Group 1 and Group 3, distinguish from each other. The p-values return values between the level of significance. However, the p-value of testing the difference between the Groups 2 and 3 does not include to the significant level (p = 0.3263). Therefore we can conclude that the individuals in groups 2 and 3 can be considered representing the same type of individuals (when having weight as the measure).

ratsl8s12 <- ratsl8s %>% filter(Group != 3)
t.test(mean ~ Group, data = ratsl8s12, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -9.0646, df = 10, p-value = 3.88e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -277.5345 -168.0155
## sample estimates:
## mean in group 1 mean in group 2 
##         265.025         487.800
ratsl8s23 <- ratsl8s %>% filter(Group != 1)
t.test(mean ~ Group, data = ratsl8s23, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -1.0686, df = 6, p-value = 0.3263
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -130.60428   51.20428
## sample estimates:
## mean in group 2 mean in group 3 
##           487.8           527.5
ratsl8s13 <- ratsl8s %>% filter(Group != 2)
t.test(mean ~ Group, data = ratsl8s13, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -27.824, df = 10, p-value = 8.345e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -283.4943 -241.4557
## sample estimates:
## mean in group 1 mean in group 3 
##         265.025         527.500

Part II: Linear Mixed Effects Models for Normal Response Variables

Now we move our focus to the BPRS data. There are 40 individuals that have been randomized to a treatment Group 1 (n=20) and Treatment Group 2 (n=20). We observe the individuals’ change in the bprs value for 8 weeks.

BPRS
##    treatment subject week0 week1 week2 week3 week4 week5 week6 week7 week8
## 1          1       1    42    36    36    43    41    40    38    47    51
## 2          1       2    58    68    61    55    43    34    28    28    28
## 3          1       3    54    55    41    38    43    28    29    25    24
## 4          1       4    55    77    49    54    56    50    47    42    46
## 5          1       5    72    75    72    65    50    39    32    38    32
## 6          1       6    48    43    41    38    36    29    33    27    25
## 7          1       7    71    61    47    30    27    40    30    31    31
## 8          1       8    30    36    38    38    31    26    26    25    24
## 9          1       9    41    43    39    35    28    22    20    23    21
## 10         1      10    57    51    51    55    53    43    43    39    32
## 11         1      11    30    34    34    41    36    36    38    36    36
## 12         1      12    55    52    49    54    48    43    37    36    31
## 13         1      13    36    32    36    31    25    25    21    19    22
## 14         1      14    38    35    36    34    25    27    25    26    26
## 15         1      15    66    68    65    49    36    32    27    30    37
## 16         1      16    41    35    45    42    31    31    29    26    30
## 17         1      17    45    38    46    38    40    33    27    31    27
## 18         1      18    39    35    27    25    29    28    21    25    20
## 19         1      19    24    28    31    28    29    21    22    23    22
## 20         1      20    38    34    27    25    25    27    21    19    21
## 21         2       1    52    73    42    41    39    38    43    62    50
## 22         2       2    30    23    32    24    20    20    19    18    20
## 23         2       3    65    31    33    28    22    25    24    31    32
## 24         2       4    37    31    27    31    31    26    24    26    23
## 25         2       5    59    67    58    61    49    38    37    36    35
## 26         2       6    30    33    37    33    28    26    27    23    21
## 27         2       7    69    52    41    33    34    37    37    38    35
## 28         2       8    62    54    49    39    55    51    55    59    66
## 29         2       9    38    40    38    27    31    24    22    21    21
## 30         2      10    65    44    31    34    39    34    41    42    39
## 31         2      11    78    95    75    76    66    64    64    60    75
## 32         2      12    38    41    36    27    29    27    21    22    23
## 33         2      13    63    65    60    53    52    32    37    52    28
## 34         2      14    40    37    31    38    35    30    33    30    27
## 35         2      15    40    36    55    55    42    30    26    30    37
## 36         2      16    54    45    35    27    25    22    22    22    22
## 37         2      17    33    41    30    32    46    43    43    43    43
## 38         2      18    28    30    29    33    30    26    36    33    30
## 39         2      19    52    43    26    27    24    32    21    21    21
## 40         2      20    47    36    32    29    25    23    23    23    23
dim(BPRS)
## [1] 40 11
summary(BPRS)
##  treatment    subject       week0           week1           week2     
##  1:20      1      : 2   Min.   :24.00   Min.   :23.00   Min.   :26.0  
##  2:20      2      : 2   1st Qu.:38.00   1st Qu.:35.00   1st Qu.:32.0  
##            3      : 2   Median :46.00   Median :41.00   Median :38.0  
##            4      : 2   Mean   :48.00   Mean   :46.33   Mean   :41.7  
##            5      : 2   3rd Qu.:58.25   3rd Qu.:54.25   3rd Qu.:49.0  
##            6      : 2   Max.   :78.00   Max.   :95.00   Max.   :75.0  
##            (Other):28                                                 
##      week3           week4           week5           week6      
##  Min.   :24.00   Min.   :20.00   Min.   :20.00   Min.   :19.00  
##  1st Qu.:29.75   1st Qu.:28.00   1st Qu.:26.00   1st Qu.:22.75  
##  Median :36.50   Median :34.50   Median :30.50   Median :28.50  
##  Mean   :39.15   Mean   :36.35   Mean   :32.55   Mean   :31.23  
##  3rd Qu.:44.50   3rd Qu.:43.00   3rd Qu.:38.00   3rd Qu.:37.00  
##  Max.   :76.00   Max.   :66.00   Max.   :64.00   Max.   :64.00  
##                                                                 
##      week7          week8      
##  Min.   :18.0   Min.   :20.00  
##  1st Qu.:23.0   1st Qu.:22.75  
##  Median :30.0   Median :28.00  
##  Mean   :32.2   Mean   :31.43  
##  3rd Qu.:38.0   3rd Qu.:35.25  
##  Max.   :62.0   Max.   :75.00  
## 
dim(BPRSL)
## [1] 360   5
names(BPRSL)
## [1] "treatment" "subject"   "weeks"     "bprs"      "week"
str(BPRSL)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
summary(BPRSL)
##  treatment    subject       weeks                bprs            week  
##  1:180     1      : 18   Length:360         Min.   :18.00   Min.   :0  
##  2:180     2      : 18   Class :character   1st Qu.:27.00   1st Qu.:2  
##            3      : 18   Mode  :character   Median :35.00   Median :4  
##            4      : 18                      Mean   :37.66   Mean   :4  
##            5      : 18                      3rd Qu.:43.00   3rd Qu.:6  
##            6      : 18                      Max.   :95.00   Max.   :8  
##            (Other):252
glimpse(BPRSL)
## Observations: 360
## Variables: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0"...
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, ...
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

We first take a look at the data by plotting the bprs values against weeks for the data. We do not draw the lines but identify the group to which each observation belongs. We can see that the individuals who point out with the highest values belong to the Group 2 almost each week. Otherwise we can see that the responses overlap strongly. We can assume a pattern where the bprs values decrease accross time for both groups. It also seems that there is a small increase of the bprs values during the last week but we cannot yet say if there has been that kind of an inrease or not.

ggplot(BPRSL, aes(x = week, y = bprs)) + geom_text(aes(label = treatment)) + scale_x_continuous(name = "Time (days)") + scale_y_continuous(name = "BPRS") + theme(legend.position = "top")

When analysing the individual response profiles by treatment group, we do not see anything relevant from the data. The groups to not seem very different, except the same observation that could be seen in the previous plot (the individuals who point out with the highest values belong to the Group 2 almost each week).

ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))

Now we analyse the differences of the groups by building a linear regression model. I build a model with week, treatment, and week*treatment as explaining variables. Only week turned out statistically significant. According to these results, we can hypothesise that the effect of treatments did not differ from each other.

# building a linear regression model
BPRS_model <- lm(data = BPRSL, bprs ~ week + treatment + week*treatment)
summary(BPRS_model)
## 
## Call:
## lm(formula = bprs ~ week + treatment + week * treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.886  -9.267  -2.801   6.513  51.318 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      47.8856     1.6970  28.218  < 2e-16 ***
## week             -2.6283     0.3564  -7.374 1.17e-12 ***
## treatment2       -2.2911     2.3999  -0.955    0.340    
## week:treatment2   0.7158     0.5041   1.420    0.156    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.35 on 356 degrees of freedom
## Multiple R-squared:  0.1897, Adjusted R-squared:  0.1829 
## F-statistic: 27.78 on 3 and 356 DF,  p-value: 3.629e-16

Now we will take a look at a more appropriate model for this kind of data, the Random Intercept Model. In the model we first use week and treatment as fixed-effect terms and (1 | subject) as a term implicating random-effect.

BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

Next we use week and treatment as fixed-effect terms and (week | subject) as the random-effect term.

BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8202  8.0511        
##           week         0.9608  0.9802   -0.51
##  Residual             97.4307  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000

When performing ANOVA between the two models, we can see from the results (chi-squared and p-value) that the random intercept and random slope model performs better and has a higher significance with lower values.

anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## BPRS_ref   5 2748.7 2768.1 -1369.4   2738.7                           
## BPRS_ref1  7 2745.4 2772.6 -1365.7   2731.4 7.2721      2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1